Search CORE

65 research outputs found

OmniLingo: Listening- and speaking-based language learning

Author: Howell Nicholas
Tyers Francis M.
Publication venue
Publication date: 10/10/2023
Field of study

In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data

arXiv.org e-Print Archive

Data-Driven Morphological Analysis for Uralic Languages

Author: Silfverberg Miikka
Tyers Francis M.
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

This paper describes an initial set of experiments in data-driven morpholog-ical analysis of Uralic languages. The paper differs from previous work in thatour work covers both lemmatization and generating ambiguous analyses. Whilehand-crafted finite-state transducers represent the state of the art in morpholog-ical analysis for most Uralic languages, we believe that there is a place for data-driven approaches, especially with respect to making up for lack of completenessin the шlexicon. We present results for nine Uralic languages that show that, atleast for basic nominal morphology for six out of the nine languages, data-drivenmethods can achieve an F-score of over 90%, providing results that approach thoseof finite-state techniques. We also compare our system to an earlier approach toFinnish data-driven morphological analysis (Silfverberg and Hulden,2018) andshow that our system outperforms this baseline.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Towards an open-source universal-dependency treebank for Erzya

Author: Rueter Jack Michael
Tyers Francis M.
Publication venue
Publication date: 01/01/2018
Field of study

This article describes the first steps towards a open-source dependency tree- bank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically analyzed and disambiguated after which they are annotated manually for depen- dency structure. In the article we present some issues in dependency syntax for Erzya and how they are analyzed in the universal-dependency framework. Pre- liminary statistics are given for dependency parsing of Erzya, along with points of interest for future research.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

A morphological analyser for Maltese

Author: Gatt Albert
Ravishankar Vinit
Tyers Francis M.
Publication venue: Elsevier B.V.
Publication date: 01/01/2017
Field of study

This article describes the development of a free/open-source morphological description of Maltese, originally created as the analysis component in a rule-based machine translation system for Maltese to Arabic and later applied to other tasks. The lexicon formalism we use is lttoolbox, part of the Apertium machine translation platform. An evaluation of the analyser shows that the coverage is adequate, at 84.90%, while precision is 92.5% on a large automatically annotated test set and 96.2% on a smaller hand-validated set.peer-reviewe

OAR@UM

Rule-based Breton to French machine translation

Author: Francis M Tyers
Publication venue
Publication date: 01/01/2010
Field of study

Abstract This paper describes a rule-based machine translation system from Breton to French intended for producing gisting translations. The paper presents a summary of the ongoing development of the system, along with an evaluation of two versions, and some reflection on the use of MT systems for lesser-resourced or minority languages

CiteSeerX

Keyword spotting for audiovisual archival search in Uralic languages

Author: Hjortnæs Nils
Partanen Niko
Tyers Francis M.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Publisher Copyright: © 2021 IWCLUL 2021 - 7th International Workshop on Computational Linguistics of Uralic Languages, Proceedings. All rights reserved.In this study we investigate the potential of using Automatic Speech Recognition (ASR) for keyword spotting for four Uralic languages: Finnish, Hungarian, Estonian and Komi. These languages also represent different levels on the high and low resource continuum. Although the accuracy of the ASR systems show there is a long way to go, we show that they still have potential to be useful for downstream tasks such as keyword spotting. By using a simple text search after running ASR, we are already able to achieve an F1 score of between 0.15 and 0.33, a precision of nearly 0.90 for Estonian and Hungarian, and a precision of 0.76 for Komi.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Free/Open-Source Morphological Analyser and Generator for Sakha

Author: Ivanova Sardana
Tyers Francis M.
Washington Jonathan
Publication venue: European Languages Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

We present, to our knowledge, the first ever published morphological analyser and generator for Sakha, a marginalised language of Siberia. The transducer, developed using HFST, has coverage of solidly above 90%, and high precision. In the development of the analyser, we have expanded linguistic knowledge about Sakha, and developed strategies for complex grammatical patterns. The transducer is already being used in downstream tasks, including computer assisted language learning applications for linguistic maintenance and computational linguistic shared tasks.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

The Relevance of the Source Language in Transfer Learning for ASR

Author: Hjortnæs Nils
Partanen Niko
Rießler Michael
Tyers Francis M.
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2021
Field of study

This study presents new experiments on Zyrian Komi speech recognition. We use Deep-Speech to train ASR models from a language documentation corpus that contains both contemporary and archival recordings. Earlier studies have shown that transfer learning from English and using a domain matching Komi language model both improve the CER and WER. In this study we experiment with transfer learning from a more relevant source language, Russian, and including Russian text in the language model construction. The motivation for this is that Russian and Komi are contemporary contact languages, and Russian is regularly present in the corpus. We found that despite the close contact of Russian and Komi, the size of the English speech corpus yielded greater performance when used as the source language. Additionally, we can report that already an update in DeepSpeech version improved the CER by 3.9% against the earlier studies, which is an important step in the development of Komi ASR.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Automatic conversion of colloquial Finnish to standard Finnish

Author: Francis M Tyers
Inari Listenmaa
Publication venue
Publication date: 24/04/2020
Field of study

Abstract This paper presents a rule-based method for converting between colloquial Finnish and standard Finnish. The method relies upon a small number of orthographical rules combined with a large language model of standard Finnish for ranking the possible conversions. Aside from this contribution, the paper also presents an evaluation corpus consisting of aligned sentences in colloquial Finnish, orthographically-standardised colloquial Finnish and standard Finnish. The method we present outperforms the baseline of simply treating colloquial Finnish as standard Finnish, but is outperformed by a phrase-based MT system trained by the evaluation corpus. The paper also presents preliminary results which show promise for using normalisation in the machine translation task

CiteSeerX